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Microbial pathogen genomes - 
new strategies for identifying 
therapeutics and vaccine targets 



Advances in high-throughput DNArssquencing techniques have given us the 
unprecedented ability to rapidly determine the nucleotide sequences of entire 
bacterial genomes. The application of these methods to the genomes of microbial 
pathogens, combined with efficient analytical tools and genome-scale approaches 
for studying gene expression, is revolutionizing our approach to the selection of 
targets for drug screening and vaccine development This is bringing new frfe to this 
important, but long-neglected, field of research. 



The decision, several years ago, by the US Department 
of Energy, the National Institutes ofHealth (NIH) and 
several international funding agencies to embark upon 
programs to map and sequence the human genome 
has led to a number of important technological 
advances that are beginning to have an impact in other 
areas of biology. Among these advances art the devel- 
opment of automated methods for the generation of 
large amounts of raw DNA-sequen ring information, 
computer software for rapidly proces si n g and analyz- 
ing primary sequence data, and techniques for the 
rapid assembly of shotgun sequencing reads, even from 
entire bacterial genomes. "Bff " gnt algorithms for simi- 
larity searching allow the rapid identification of pro- 
tern- en coding sequences that are homologous to other 
genes, the sequences of which are held in public and 
private databases; as from April 1996, approximately 
500 megabases (Mb) of nucleotide sequence were 
contained in GenBank, and approximately 200*000 
sequences were held in the SWISS-PRDT/Genpept/ 
PIR database of non-redundant proteins. Combined 
with the wealth of biochemical information that 
is archived in public databases, it has become possible 
to describe rapidly the full repertoire of genes in a mi- 
crobial genome, and to predict many of the meta- 
bolic pathways that an organism may utilize. 

Progress in this field has been stimulated by the 'inter- 
ests of the biotechnology and pharmaceutical indus- 
tries in using genome-sequencing data as a basis for 
drug discovery. In rum, this has led to the develop- 
ment of proprietary databases containing genomic 
information, which provide the basis for in silia> experi- 
ments to identify novel targets for drugs, and for 

D. JL Smith (smith (wniLCom )isal Genome Hicmpsutio Corporation , 
1 DO Beaver Sum, Wnhham. MA 02154, USA* 



laboratory experiments to identify genes that perform 
critical functions. This article summarizes some recent 
- developments in this important area, focusing on bac- 
terial sequences, and provides examples to illustrate 
how gmnTTifr-ggq npn ring mfLiiiurm rm from microbial 
pathogens can be used to select targets for vaccine and 
drug development. The overall process used to pro- 
ceed from sequence generation to target validation is 
illustrated in Hg. 1. 

Large-scale sequencing of bacterial genomes 

Many laboratories use automated sample-prepar- 
ation techniques and fluorescence-based gel readers 
[such as that produced by Applied Biosrystems Inc., 
(ABI); Foster City, CA, USA] for the large-scale 
sequencing of bacterial genomes. These instruments 
have the advantage that they are efficient, and relatively 
easy to setup and operate. A few laboratories use com- 
puter-assisted multiplex sequencing to achieve the 
same end 1 , in multiplex sequencing, samples consist- 
ing of pools of up to 20 plasmids are processed through 
sample preparation and gel electrophoresis, and the 
resulting sequences are determined from elecrxoblots 
of the gels by hybridization with radioactive orfluor- 
escendy labeled probes. This technique cpti be used to 
generate 40 films (or digitized images) from each 
sequencing gel Although multiplex sequr ncing is effi- 
cient at producing large amounts of 'shotgun 1 data, it 
is more difficult to set up and operate in the labora- 
tory than is fluorescence-based gel sequencing, and it 
is not suited to directed-finishing strategies. ABI 
machines are used in the authors laborato ry to gener- 
ate primer-directed reads for finishing and sip closure. 

During the past year, a group at The Instirute for 
Genomic Research (TIGK; Gaithcrsbur^, MD, USA)' 
reported the complete sequences of Haemophilus 
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- influenzae (1.8 Mb), a major cause of respiratory infec- 
tions and meningitis, especially in children 2 , and of 
Mycoplasma genitalium (0.6 Mb), which causes ure- 
. thritis 3 . Approximately 1.6 Mb of contiguous 
sequence from the 4.7 Mb Escherichia calx "genome has 
" been published 4 , and the sequencing of a further 2 Mb 
was reported, at die 1995 Genome Sequencing and 
Analysis VII (GSA-VE) meetings. The genome of 
Helicobacter pylori -(1.7 Mb), the major cause of 
stomach ulcers, has been sequenced by Genome 
Therapeutics Corporation (GTC; Waltham, MA, 
USA) under a privately funded micro biai-patho gen 
sequencing program, More than half (1.5 Mb) of the 
2.8 Mb genome of Mycobacterium leprae (the etiologic 
agent of leprosy) has also been sequenced by GTC, 
and is available through . GenBank, the GTC web 
sine^hnp//rvww. cric.com>! and through MycDB 
<hr4j://ww^:biochernJcdi. se/MycDB.hnnl> 1 which 
contains mycobacterial genome mapping and sequence 
information 6 . 

Other microbial pathogens thai are currently being 
sequenced include Neisseria gonorrhoeae (University of 
Oklahoma, Norman, OK, USA), Streptococcus pyogenes 
(Universiry of Oklahoma), Treponema pallidum 
(University ofTexas, Houston, TX, USA, and T3GR), 
Mycobacterium tuberculosis (GTC and the Sanger 
Centre, Hinxron, Cambridge, UK), and Staphylococcus, 
aureus [GTC, and Human Genome Sciences (HGS; 
Rockville, MD t USA)]. 

In addition to these pathogens, the genomes of sev- 
eral archaebacteria and other non-pathogens are being 
sequenced. These include Methanococcus janasckii 
(TIGR), Pyrococcus Juriosis (University of Utah, Salt 
Lake City UT, USA), Suljolobus solfatariais Palhousie 
University, Halifax, Nova Scoria, Canada), and 
Pyrobaculum aervphilum (California Institute of 
Technology Pasadena, CA, USA, and University of 
California, Los Angeles, CA, USA). The 1.7 Mb 
genome of the archaeon Methanobacterium thermo- 
autotrophicum is near completion at GTC (Ke£ 7). 
Appirodrnately 2Mb of the 4.1 Mb Bacillus subtilis 
genome has now been sequenced by a consortium of 
European and Japanese laboratories, and the project 
may be completed by the end of 1996 (Hef. 8). 
Approximately 1 Mb of genomic sequence from the 
2.7 Mb genome of the cyano bacterium Synechocystis 
sp. 6803 was recendy published 9 . 

"Within the next couple of years, therefore, we can 
expect an explosion of bacterial-genome sequence 
information from species representing a variety of 
phylogeneric lineages, including many pathogens. 

Pharmaceutical companies have shown considerable 
interest in .using pathogen genomics to facilitate the 
development of vaccines and small-molecule thera- 
peutics. For example, researchers at Glaxo Wellcome 
have sequenced a substantial fraction of the if. pylori 
genome to assist in the process of drug discovery. Over 
the past year, GTC has formed two research alliances 
with pharmaceutical companies to cake advantage of 
sequences from microbial pathogens: one with 
Astra AB, focusing on the development of new anti- 
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Figure 1 

Row diagram illustrating the process by which a microbial genome sequence is 
analysed and the information is used to direct experiments and aid in target selec- 
tion for therapeutics development The individual steps are referred to throughout 
the text In the case of vaccine candidates, gene products from selected targets are 
expressed and tested in animal models. 

biotics and vaccines to treat H. pylori infection, and 
one with Schering-Plough (Union, NJ, USA), to 
develop broad-spectrum antibiotics and vaccines. 
Although the genomic route to drug discovery for 
bacterial pathogens is new and remains unproved, the 
basic paradigm (outlined below) of gene identification, 
followed by functional analysis and drug screening, is 
well established. Thus, it is likely that more companies 
will become involved, and rh^r in the the future, ad- 
ditional research alliances between genomics com- 
panies and the pharmaceutical industry will material- 
ize in this area. 



From sequence to genes 

The first task when confronted with an entire bac- 
terial-genome sequence, is to identify all the genes. 
This can be accomplished using a variety of tech- 
niques, but die most successful approaches use a combi- 
nation of reading-frame and co don- usage analysis, 
together with similarity searching, to identify putative 
genes with homology to previously described se- 
quences. Commonly used tools include GeneMark 10 , 
GenomeB'rowser 11 , BLAST (Ke£ 12), and highly 
parallelized implementations of the Srmth-'Vfocerman 
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ali gnm ent,, such as BLAZE, or MPsrch (Re£ 13). In 
general, organism-specific codon usage is highly pre- 
dictive for bacterial- genes > but its effective use depends 
on the existence of sufficient information to generate 
accurate codon-.usage niarrices. In some cases, subsets 
of genes within an organism will exhibit codon-usage 
patterns that -deviate rignificanriy from the norm 14 . 
Such genes are thought to represent evolutionariiy 
recent acquisitions by phage transduction, conju- 
gation, or some other form ofhori2ontal transfer from 
other organisms. If enough of these genes are present, 
codon-usage tables of genomic subsets can be con- 
structed to idendfy them. Translations! start sites can 
be identified by the occurrence of start codons thai- 
coincide with abrupt changes in codon "usage, the in- 
itiation of homology to previously characterized 
genes, or the presence of Shine—Dalgamo sequences 13 . 
Automated analysis tools (such as GenomeBrowser 11 ) 
that provide a graphical display of open reading frames 
(OIU^), codon usage, database homologies and other 
features, make the task, of identifying bacterial genes 
and their relationships with each other in the genome 
relatively straightforward. "With the increasing pace of 
bacterial-genome sequencing, there is an emerging 
need for second-generation tools that wiD automate 
most of the laborious annotation process. 

From genes to function 

The second phase in the analysis of bacterial 
genomes is to identify the function of as many genes 
as possible. Currendy, sequence homology is the most 
powerful tooL A high degree of homology between 
the putative translation pro dun of a newly identified 
gene and an enzyme whose function has been 
thoroughly studied in other organisms, provides strong 



support for the function of thar protein, especially if it 
is the only homolog in the genome under scrutiny. 
Other useful tools include programs thar identify 
sequence motifs from databases such as PPJDSITE 
(JLef. 16), BLOCKS (Ref. 17), BEAUTY (Kef. 18) 
and ProDom (Re£ 19). If one is attempting to 
identify vaccine candidates, then examining highly 
expressed cell-surface proteins is relevant, so it is' then 
useful to know whether a protein contains a secretion 
signal, even if nothing'else is known about it. Although 
the tools described here are very good at identifying 
homologies, 25-40% of the genes in a bacterial 
genome typically fail to show significant similarity 
with known proteins. 

Once the set of sjnulariry-scarching tools has been 
exhausted, one must return to molecular biology to 
further elucidate the function and expression partem 
of predicted genes. Commonly used approaches to 
identifying essential genes in an organism ^rh ifip- the 
use of gene knockouts, disruptions using rxansposon- 
mediated mu ta g enesis, or homologous recombinatipn 
with disrupted gene-constructs thai- contain an anti- 
biotic-resistance cassette. Gene disruptions be 
generated-in a variety of ways, including sophisticated 
'hit-and-run ' approaches that interrupt a gene with- 
out introducing polar effects into downstream ORFs 
(R.e£ 20). However, a gene-by-gene approach to the 
study of a whole genome is certainly t-tthp cDrisurning 
and labor intensive. 

The availability of. large amounts of genome- 
sequence information has stimulated the development 
of new approaches to functional analysis on a genomic 
scale. This has been particularly true for researchers 
imesrigating yeast, where a concerted effort is being 
made to ascertain the function of every OKP in the 
genome. Such strategies include the conceptually 
simple, but technologically advanced, technique of 
making micro arrays of polymerase chain reaction 
(PCR)-amplined gene sequences on slides to 
allow the fluorescence-based detection of quantitative 
hybridization signals rroin labeled cDNA probes on 
large numbers of genes, simultaneously — perhaps even 
all the genes of an organism 21 . An ingenious PQU 
based approach to efficient sequence-signature-based 
expression analysis has recently been demonstrated 22 . 
For example, a technique termed 'g-enetic finger- 
printing' promises to replace individual gene knock- 
outs by a global nansposon-mutagenesis approach 23 . 
Insertions are induced en masse in a strain of interest, 
the strain is grown under a variety of conditions, 
and PCR products are analysed to identify genes in 
which transposDn hops are under-represented because 
the genes are required for growth 23 . A. conceptually 
similar dropout technique, which uses tagged rrans- 
posons to identify the Salmonella typkzmunum genes 
required for virulence in a mouse model, has been 
described 2,4 . 

^ Techniques that probe subsets of genets for a specific 
functionality, such as secretion or induction during 
growth in the host, have also been described. These 
techniques provide clones from which signature 
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sequences can be derived, so corresponding 
genes can he identified by comparing them *wim 
the genomic sequence. The IVET {in vivo expression 
technology) technique, which detects gene fusions 
char result in the in vivo selectable expression of a 
defecnVe purA gene or anribionc-resiscance marlw 
has been used to identify Salmonella genes, the expres- 
sion of which is induced when the pathogen is grown 
in .mice 25 , finally, protein microseqnencing 26 and 
rnass-spectrometry-based peptide analysis 37 have 
been used to identify protein components (e.g. outer- 
membrane proteins) in partially purified mixtures, 
or to identify specific proteins separated by two- 
dimensional gel elecrrophorcsis. Sequences generated 
in this manner can be used to correlate specific pro- 
teins with the gene sequences from which they are 
expressed. 



studies to identify those with the most useful 
properties or inhibitors. 
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Tbrget selection and validation 

The techniques described in the previous section 
can be used to identify genes in specific functional 
categories that may represent good targets for drug or 
vaccine development In general, when developing 
new antibiotics, one is interested in genes that are 
■erenntial under all growth conditions (and preferably 
even in quiescent cells), and for which, inhibitors with 
useful chemical properties, such as permeability and 
low toxicity; can be identified One advantage of 
having the entire sequence of a genome is t-haf targets 
can be prioritized in terms of their activities and the 
properties of compounds that- are known to interact 
with them. Even with the results of knockout or in 
vivo expression experiments, additional biological 
info rmati on can aid in wi mu ring down the field of 
choices. For example,- genes can be selected on the 
basis of their probable roles in intracellular metab- 
olism. Databases, such as EcoCyc (Re£ 28) or PUMA 
(B*e£ 29), that describe known metabolic pathways 
can be helpful in this regard. Detailed structural infor- 
mation about homologs of identified genes (deter- 
mined using the Protein DataBank 30 ) can be used to * 
assist in the molecular modeling of inhibitors (some 
resources for molecular modeling* can be found at 
Re£31). 

As more genomes are sequenced, it will become 
possible to identify genes *har are unique to a par- 
ticular organism or group of organisms, or genes 
that are conserved in certain groups. Thus, for 
example, it will be possible to- use electronic com- 
parison to identify genes that are present in H. pylori 
but not in other .gut-dwelling bacteria such as E. culi } 
providing a basis for the development of antibiotics 
specific to H. pylori. Although combinatorial chem- 
istries promise to speed up our ability to synthesize 
and screen large numbers of unique chemical entities, ' 
the sequence-based approach described here provides 
an avenue for the rational identification and selection 
of key targets for therapeutics development. Uln- 
rnate validation of the targets will, of course, require 
additional experiments such -as protein expression, 
biochemical-assay development and animnl 
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